[RFC] Device Array and compat Device Vector by danhoeflinger · Pull Request #2660 · uxlfoundation/oneDPL

danhoeflinger · 2026-04-08T18:46:52Z

This RFC proposes adding an experimental device_array - a RAII container for USM device memory. It gives users a focused interface for managing device allocations and and provides the most-travelled usage patterns for device_vector.
The RFC also proposed adding a compat namespace and device_vector within it. device_vector wraps device_array and provides a similar set of functionality to Thrust's device_vector.

Four documents:

README.md - High level decisions, motivation, diagram
device_array.md - design and API for device_array
device_vector_compat.md - design and API for device_vector and helper classes
usage_pattern_study.md - survey of many thrust::device_vector and dpct::device_vector usages, includes analysis of alternatives and why they diverged from Thrust

Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>

akukanov · 2026-05-08T10:19:51Z

+| Aspect | Proposed (oneDPL) | Thrust | sycl-thrust | SYCLomatic |
+|---|---|---|---|---|
+| **Default Allocator** | `device_allocator<T>` wrapping `sycl::malloc_device`; custom `DeviceAllocator` concept | `thrust::device_allocator<T>` (CUDA `cudaMalloc`) | `device_allocator<T>` (`sycl::malloc_device`); supports alignment template parameter | USM: `sycl::usm_allocator<T, shared>` / Buffer: `__buffer_allocator<T>` |
+| **Memory Model** | **Device memory** via `sycl::malloc_device`; host access triggers explicit transfers | **Device memory** via `cudaMalloc`; host access triggers explicit transfers | **Device memory** via `sycl::malloc_device`; explicit transfers | **Shared memory** via USM shared or SYCL buffer/accessor; runtime manages placement |


Is "host access triggers explicit transfers" accurate? I mean, if something is triggered by another action, it is implicit and not explicit, no?

akukanov · 2026-05-08T10:26:15Z

+  Rarely used in practice (see [usage study](usage_pattern_study.md)),
+  high implementation complexity for device memory.
+
+- **Host-side operations block but do not synchronize with prior work.**


What does "block" mean in this sentence, and is it important in the context of design decisions?

It means they wait() until the operation is complete before returning, but no guarantee that previously queue work finishes before this starts. This is mentioned because it is a departure from the thrust::device_vector. I can make this more clear, and provide better justification for why this decision is made (fitting better with sycl, and also from the usage study showing user's preferences)

akukanov · 2026-05-08T10:29:06Z

+  - Individual headers:
+  `<oneapi/dpl/device_array>` and `<oneapi/dpl/device_vector>`.. `device_vector`
+  would transitively include `device_array` since it depends on it.
+  - We could have a `compat` header and a individual `device_array` header. However, if we intend to use `device_array` within our own sycl implementations, that may impact our decision here.


This is not quite clear to me. What do you mean by "within our own sycl implementation", why would we want to use device_array there, and how that impacts the decision about headers?

akukanov · 2026-05-08T10:42:44Z

+## `device_span<T>`
+
+`device_array` is not device-copyable (it owns memory). For kernel capture,
+non-owning views, and range composition, use `device_span<T>` via `.span()`.


I do not think our own span class is needed, because SYCL 2020 says that std::span and sycl::span, if supported by the implementation, must be device copyable. I suggest to remove device_span and the methods that return it.

Thanks for the pointer to this, I can look into this. Id be happy to just rely upon sycl::span.

akukanov · 2026-05-08T10:46:19Z

+template <typename T, typename Alloc = device_allocator<T>>
+class device_array {


I would like to be convinced that the allocator template parameter is needed for this class. It is needed for device_vector, sure, but why should this simplified container allow for an alternative allocator?

The main argument from my perspective is that if we are building such a system for device_vector to have a "faithful" migration target, we will have and maintain that capability / complexity. If we have it, why not include it, and directly make device_vector depend upon device_array.

I think there are useful examples which can motivate having this feature, and it is easier to add at the beginning than later on: Memory pooling, alignment requirements, bookkeeping / debugging.

However, I'm open to the idea of having a third internal class which implements allocator friendly RAII memory management, and having both public classes hold a member of that internal class. device_array could hardcode a simple device_allocator with no frills. The question becomes whether we think there is enough utility in the feature to justify the API complexity. I don't have a strong opinion here really. The argument against is that the SYCL folks explicitly decided not to include an official usm device allocator.

akukanov · 2026-05-08T10:51:33Z

+    template <typename InputIt>
+    device_array(InputIt first, InputIt last, sycl::queue q);


Would it make sense to move towards range-based APIs, instead of iterator-based? A separate constructor from std::vector would not be necessary if the generic one accepts any range. For C++17, we may add a named requirement for a range that matches the C++20 concept, as well as a simple to-range converter for a pair of iterators.

akukanov · 2026-05-08T11:01:36Z

+    // Device-to-device copy (allocates on the provided context+device)
+    // Supports cross-device copies: source and destination may be on different devices
+    static device_array copy_from(const device_array& src, sycl::queue q);
+    static device_array copy_from(const device_array& src,
+                                  size_type offset, size_type count, sycl::queue q);
+    static device_array copy_from(const device_array& src,
+                                  sycl::context ctx, sycl::device dev);
+    static device_array copy_from(const device_array& src,
+                                  size_type offset, size_type count,
+                                  sycl::context ctx, sycl::device dev);


These functions really look like a combination of the "uninitialized" constructor followed by an assignment, just with an additional offset argument. What are the benefits of those - or why copy operations are not appropriate?

akukanov

The device_array still seems over-engineered to me,. I think it is better to start with tiny purposeful API and extend it on demand, rather than trying to put as much as possible from the beginning.

Actually, I would start with the use cases for this container (and code examples showing how it is supposed to be used), and design from those.

akukanov · 2026-05-08T11:31:14Z

+    template <typename InputIt>
+    device_array(InputIt first, InputIt last, sycl::queue q);


What is the expected category of iterators/ranges that can be used as the source of data? The name suggests that the requirements are minimal (input_iterator), but then how are the data copied to the device memory underneath the container?

akukanov · 2026-05-08T11:40:45Z

+    // Resize — new elements are uninitialized by default
+    void resize(size_type count);
+    void resize(size_type count, sycl::queue q);
+    // Resize — new elements filled with value
+    void resize(size_type count, const T& value);
+    void resize(size_type count, const T& value, sycl::queue q);


Does resize operate within the container capacity, or do you want it to be truly resizable?

This is thinking with the idea that device_vector needs this functionality, and device_array handles its memory. Similar to the question on allocators, we could have an internal class which handles this functionality for both classes and "turn off" these public features for a no-frills device_array API. I think that is probably the best option.

akukanov · 2026-05-08T11:46:04Z

+- **Should async overloads be in the initial proposal or deferred?** 
+    This provides more control over synchronization than merely an in-order queue,
+    but it is unclear whether users who are wanting this would just want to work
+    with USM memory and memcpy directly.


In my opinion, everything that is outside of the "minimally viable" scope should be deferred. Async operations seems to be there, but even some other operations might be there as well.

Sure, we can cut this down to a minimal version.

akukanov · 2026-05-08T11:48:00Z

+    // Single-element host access (blocking, creates queue from context & device)
+    T read(size_type pos) const;
+    void write(size_type pos, const T& value);
+
+    // Single-element host access (blocking, provided queue is used for copy submissions)
+    T read(size_type pos, sycl::queue q) const;
+    void write(size_type pos, const T& value, sycl::queue q);


I would call these host_{read,write} for clarity.

akukanov · 2026-05-08T11:55:17Z

+`device_array` is not device-copyable (it owns memory). For kernel capture,
+non-owning views, and range composition, use `device_span<T>` via `.span()`.


I do not quite understand why "it owns memory` means it is not device-copyable. The memory for data is clearly not embedded into the class layout, so the ownership is logical, not physical.

I can update this language, the technical reason is that a context is owned here. You are correct that the ownership is logical rather than physical. This class is not meant to be used directly on the device, but rather via a span or a raw pointer.

This class is not meant to be used directly on the device, but rather via a span or a raw pointer.

And this is what bothers me. We set usage restrictions based on the anticipated implementation details, while ideally it should be the opposite.

I really would like this container to be usable as a sized random-access range on the device, so that its use with oneDPL algorithms would be natural and require no special support. For that, it should be device copyable. For that, we should either avoid embedding types that are not device copyable, or just ignore their existence and claim that it is device copyable anyway, designing the rest in such a way that operations allowed on a device never touch the parts that are not supposed to be there.

Well, without some exterior system of storing contexts (in a global registry for instance), there are technical limitations within SYCL that make this challenging.

A (non device-copyable) context is the minimum requirement to do transfers and from the device, and deallocate memory. So either we store the context here, or must have a system for storing them globally and storing pointers, or searching through a stored list. I really don't like the idea of having such a global registry, but it is something we could do. Perhaps you have some other idea to get around this.

I don't think its necessarily bad or difficult to have this as the RAII "owning" container, and having lightweight non-owning ranges / iterators like a span or raw usm pointer when we want pass them around to use them in a kernel or oneDPL API. I think this separation is a feature, not a burden, in that it separates the scope of ownership from the usage. This is the same with device_vector, it is not used directly in cuda kernels, but rather after getting a non-owning device_ptr, or very commonly extracting the raw pointer and using that directly.

I much prefer thinking of device_array more similar to a std::vector than to a std::shared_ptr with reference counting or something like that.

So either we store the context here, or must have a system for storing them globally and storing pointers, or searching through a stored list.

Why not just store it in the host allocated memory though? As far as I understand, the extra overhead would not be on the hot path, and likely rather negligible comparing to the overhead of allocating USM and transferring the actual data.

I much prefer thinking of device_array more similar to a std::vector

I fully agree. Now compare:

std::vector v(/*some args*/); std::for_each(v.begin(), v.end(), lambda); // works std::ranges::for_each(v, lambda); // works tbb::parallel_for(0, v.size(), [&v](int i){ foo(v[i]); }); // works dpl::device_array da(/*some args*/); dpl::device_policy policy(q); dpl::for_each(policy, da.begin(), da.end(), lambda); // works dpl::ranges::for_each(policy, da, lamdba); // does not work, requires span or alike q.parallel_for(da.size(), [=](sycl::id<1> i){ foo(da[i]); }); // does not work, requires da.data()

Yes, your code example is compelling, but also the TBB / vector combination does not include a host / device relationship where ownership of host side entities becomes challenging. Also, your final example uses the subscription operator from device_array on the device, which the current design does not allow for (it omits this operator). I think offering a subscription operator on this container would be very confusing if it did not provide implicit host device transfers similar to device_vector. My intention with device_array is that it provides more intentional and direct control over host vs device usage, while keeping most of the convenience of device_vector.

Please correct me if I misunderstand the suggestion, but from a technical standpoint, I think you are suggesting that when constructing a device_array we allocate some "host_state" struct which includes context, and only hold a pointer to that, allowing device_array to stay device-copyable. However, something eventually needs to destroy that host_state. Is it the destructor (after reference counting)?

Lets look at the rules for device copyable:

Type T has at least one eligible copy constructor, move constructor, copy assignment operator, or move assignment operator;
Each eligible copy constructor, move constructor, copy assignment operator, and move assignment operator is public;
The effect of each eligible copy constructor, move constructor, copy assignment operator, and move assignment operator is the same as a bitwise copy of the object;
Type T has a public non-deleted destructor; and
The destructor has no effect.

The copy and move constructors must be equivalent to bitwise copies, and the destructor should have no effect. We could maybe make this the case on the device by turning off aspects of the container with __SYCL_DEVICE_ONLY__, but I'm not sure it fits with device copyable. Even if we do, we are probably reference counting on the host side to see who is responsible for destroying the host_state and the device memory, which I'd prefer to avoid.

This is why I am in favor of having a simple host-side container for lifetime management, and a separate lightweight non-owning layer for actual usage (relying on existing usm pointer and sycl::span).

I prefer to see an object that is passed to the device as a shadow copy of the object on the host, rather than an independent copy participating in the lifetime management. In other words, no reference counting involved, and instead we say that the host instance of the class is solely responsible for lifetime management and should remain alive for as long as any device code may access the data in any possible way, including via the USM pointers. That is the same requirement you have to have anyway, just extended to the shadow copies of the object as well.

And with that documented semantical restriction, making a shadow copy is equivalent to bitwise copy, and destroying a shadow copy has no effect. In other words, if the lifetime control conditions are met in the user code, the object is de facto device copyable - that is, a SYCL implementation can safely copy the bits back and forth as it wants.

danhoeflinger added 14 commits April 3, 2026 11:33

First draft device_vector background

c193d39

Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>

adding key design decisions

830b2d7

Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>

adding a usage example

0d6c2ed

Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>

experimental device vector

650a25e

Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>

minor improvements

82c6e4d

Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>

replacing table with targeted section

a48f51d

Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>

adding extra decisions and open questions

af51e09

Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>

removing redundant section

5b7e982

Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>

wording

d5ab776

Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>

removing allocator

8dedb9c

Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>

discussion of queue association and open questions

5d0077e

Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>

iterate from self-review

de91e6c

Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>

removing unused vector from example

ba93348

Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>

More iteration on self review

012561c

Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>

akukanov added the RFC label Apr 13, 2026

danhoeflinger added 6 commits April 13, 2026 12:35

language improvements for decisions

7b12b88

Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>

Updates for table to show proposed

f313319

Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>

removing redundancy

ea49728

Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>

improve examples

4a398f5

Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>

remove boost compute from table

83cf635

Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>

add language about boost compute

9818222

Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>

danhoeflinger marked this pull request as ready for review April 13, 2026 17:48

danhoeflinger added this to the 2022.13.0 milestone Apr 13, 2026

danhoeflinger requested review from MikeDvorskiy, SergeyKopienko, adamfidel, akukanov, dmitriy-sobolev, mmichel11 and timmiesmith April 13, 2026 18:34

danhoeflinger added 2 commits April 21, 2026 17:04

fixing example usage and filling in missing APIs

b75eedd

Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>

improve usage study specifics

751bd4a

Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>

vossmjp reviewed Apr 23, 2026

View reviewed changes

Comment thread rfcs/proposed/device_vector/README.md Outdated

danhoeflinger added 3 commits April 23, 2026 09:56

minor range language changes

162c495

Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>

adjust table ordering

7879272

Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>

simplified device_array proposal

9721af1

Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>

danhoeflinger force-pushed the dev/dhoeflin/rfc_device_vector branch from 4526d5c to 9721af1 Compare April 24, 2026 18:49

danhoeflinger added 20 commits April 24, 2026 15:06

refine device_array proposal

ac21e46

Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>

refine device_vector compat proposal

01d358c

Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>

adjust top line RFC to point to compat layer & device_array

7e6faa3

Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>

Add details to range support

e60de0b

Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>

minor details to device_vector

5a3e69d

Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>

adding open questions

557894d

Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>

distill device_vector compat

ad17c41

Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>

distill top readme

6f0672d

Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>

distill device array

0509bfc

Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>

diagram updates

0befbb9

Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>

fix table

02a0ba9

Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>

diagram typo

dbbdfca

Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>

remove incorrect example

5c1f519

Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>

reword question

f6811fb

Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>

rewording

b7c1e21

Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>

adjust namespace section; reduce redundancy

be87126

Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>

minor fixes

51ba663

Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>

mentioning no default construction

22f5123

Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>

copy semantics

be4169c

Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>

minor fixes

c6cb82b

Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>

danhoeflinger changed the title ~~[RFC] Device Vector~~ [RFC] Device Array and compat Device Vector Apr 29, 2026

akukanov reviewed May 8, 2026

View reviewed changes

		template <typename T, typename Alloc = device_allocator<T>>
		class device_array {

		template <typename InputIt>
		device_array(InputIt first, InputIt last, sycl::queue q);

		`device_array` is not device-copyable (it owns memory). For kernel capture,
		non-owning views, and range composition, use `device_span<T>` via `.span()`.

Conversation

danhoeflinger commented Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

danhoeflinger May 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

akukanov left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

danhoeflinger May 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

akukanov May 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

akukanov May 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

danhoeflinger commented Apr 8, 2026 •

edited

Loading

danhoeflinger May 8, 2026 •

edited

Loading

danhoeflinger May 8, 2026 •

edited

Loading

akukanov May 8, 2026 •

edited

Loading

akukanov May 8, 2026 •

edited

Loading